require(plotly)
require(hierfstat)
require(adegenet)
require(PopGenReport)
require(pheatmap)
require(tidyverse)
require(DiagrammeR)
require(poppr)
require(genepop)
require(graph4lg)
require(knitr)
require(pegas)

Readme

This is an rstudio project. If you’d like to pre-rendered figures, read a summary of analysis and view code, please open the html file in a browser.

To conduct the analyses on your computer, edit or run code: clone this repository into a directory on you r local machine and open the .Rproj file in Rstudio. All data and analyses are available in the github repository at https://github.com/david-dayan/rogue_half_pounder.git

Rationale

ODFW has proposed using half pounder abundance as the sole index for determining sliding scale fishing regulations for winter steelhead on the Rogue as (i) half pounder abundance should integrate juvenile freshwater and early ocean conditions for steelhead regardless of whether they express the half pounder phenotype and (ii) half pounder abundance is predictive of steelhead abundance in historical dam passage counts. However, the relative proportion of winter vs summer run life histories expressed by half pounders is unknown. Furthermore there is limited genetic information about the three main runs of steelhead on the Rogue (winter early-summer and late-summer (sometimes called fall))

This study attempts to use neutral and adaptive GTseq markers to examine population structure and run timing among Rogue River steelhead

Data Summary

Samples

Half Pounders
Samples were collected during ODFW’s Lower Rogue Seining Project in 2018 and 2019. The Lower Rogue Seining Project estimates escapement for Coho, late-run summer steelhead, half-pounder steelhead and fall chinook by beach seining near Huntley Park at approximate river mile 8, three times weekly from July through October. Half-pounder steelhead were identified as individuals with fork length 250 - 410mm and sampled in batches of up to 50 fish each day for 11 days from September 7th to October 1st 2018 totaling 384 individuals and 18 days from August 14th to September 25th 2019 totaling 331 individuals. Caudal fin clips were taken for DNA extraction, and placed in daily batch vials containing 95% ethanol. Note that due to batch collection of fin clips, that these numbers are inflated (some individuals have multiple tissue samples - identified later and filtered)

Also ~5% of samples represented twice in GTseq library as QAQC samples

Adults
45 winter and 45 early run summer run fish.. Adult summer steelhead were sampled at the Cole River Hatchery sorting pond on 6/26/2019 and (b) adult winter steelhead were sampled at Cole Rivers Hatchery (Rogue River) and the Applegate River from adult brood stock for 2019. Finally 166 additional late-run summer fish (fall run) were sampled at the Huntley Park seine on the lower Rogue.

Genotype Data

Information about sequencing data, genotype calling and filtering is available in the relevant R notebook (half_pounder_genotyping_notebook.Rmd) in this repository.

load("./genotype_data/genind_2.0.R")
load("./genotype_data/genotypes_2.2.R")
genos_2.2 <- ungroup(genos_2.2)
kable(genos_2.2 %>%
  group_by(run, year) %>%
  tally(), caption = "Final Filtered Dataset")
Final Filtered Dataset
run year n
fall 2019 157
halfpounder 2018 338
halfpounder 2019 305
Summer 2019 42
Winter 2019 40

Workflow

Our primary goals are to (a) examine population structure within and among the three run timing groups in the Rogue River (early summer, late summer and winter runs) (late summer referred to as fall in notebook) and (b) classify half-pounders into run timing group. To begin, we will collect nucleotide diversity index at he marker and population levels, examine population structure at neutral and adaptive markers using multivariate and bayesian approaches, and assign half pounders to run timing groups based on a genetic classifier.

Diversity Metrics

Here we calculate both per-marker and per-population diversity metrics (Ho He HWE Fis and Fst)

Heterozygosity

# first Ho and He
n.pop <- seppop(genind_2.0)

hobs <- lapply(n.pop, function(x) (summary(x)$Hobs))
hobs <- as.data.frame(t(do.call(rbind, hobs)))
hobs <- hobs %>%
  rownames_to_column(var="marker")
hobs <- hobs %>%
  pivot_longer(-marker, names_to = "run", values_to = "Ho")
#ggplot(hobs)+geom_boxplot(aes(x=pop, y=hobs))+theme_classic()+xlab("Run Timing")+ylab("Observed Heterozygosity")

hexp <- lapply(n.pop, function(x) (summary(x)$Hexp))
hexp <- as.data.frame(t(do.call(rbind, hexp)))
hexp <- hexp %>%
  rownames_to_column(var="marker") %>%
  pivot_longer(-marker, names_to = "run", values_to = "He")

marker_divs <- hexp %>%
  left_join(hobs)
## Joining, by = c("marker", "run")
#now lets add the annotation status (neutral vs adaptive)

marker_mapping <- readxl::read_xlsx("./metadata/final_mapping_results.xlsx", sheet = 1)

marker_divs <- marker_mapping %>%
  select(marker, `Presumed Type`) %>%
  mutate(marker = str_replace(marker, "Omy(\\d+)", "Chr\\1")) %>% #marker name convention is different
  mutate(neutral = if_else(str_detect(`Presumed Type`, 'Adaptive|adaptive'), "adaptive", "neutral")) %>%
  right_join(marker_divs)
## Joining, by = "marker"
# lets add known run-timing marker as a grouping variable, and conver to longer format
marker_divs <-  marker_divs %>%
  mutate(run_timing_marker = if_else(str_detect(`Presumed Type`, 'Run|run'), "run" , "non-run")) %>%
   pivot_longer(c("Ho", "He"), names_to = "obs_exp", values_to ="H")

#nice, now lets make a long table


# lets throw up some nice plots here
# first for all markers
ggplot(data=marker_divs)+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("all markers")

ggplot(data=marker_divs[marker_divs$neutral=="neutral",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("neutral markers")
## Warning: Removed 88 rows containing non-finite values (stat_boxplot).

ggplot(data=marker_divs[marker_divs$neutral=="adaptive",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("adaptive markers")
## Warning: Removed 88 rows containing non-finite values (stat_boxplot).

ggplot(data=marker_divs[marker_divs$run_timing_marker=="run",])+geom_boxplot(aes(x=run, y=H, fill = obs_exp))+scale_fill_viridis_d()+theme_classic()+ggtitle("run-timing markers")
## Warning: Removed 88 rows containing non-finite values (stat_boxplot).

Are these differences between populations significant? What about differences between nuetral and adaptive sets of loci. To test with the same number of loci (across populations) we’ll use a monte-carlo test and 999 permutations. To test across different sets of loci (neutral vs adaptive), we’ll use a simple t-test

#lets make a way to easily called different snp classes 
#"neutral" markers 

neutral_loci_names <- marker_mapping %>%
  filter(str_detect(`Presumed Type`, 'neutral|Neutral')) %>%
  dplyr::select(marker)

#different naming convention, lets fix
neutral_loci_names <- str_replace(neutral_loci_names$marker, "Omy28", "Chr28")

#run timing markers
run_timing_loci_names <- marker_mapping %>%
  filter(chr == "28" | CRITFC_chromosome == "28") %>%
  filter(str_detect(`Presumed Type`, 'run|Run')) %>%
  select(marker)

#different naming convention, lets fix
run_timing_loci_names <- str_replace(run_timing_loci_names$marker, "Omy28", "Chr28")

## test for all pairwise differences at run-timing markers
#make a pop separted genind at run timing markers
chr28_seppop  <- seppop(genind_2.0[loc=run_timing_loci_names])
## Warning: the following specified loci do not exist: Omy_RAD52458-17
# fall-half
Hs.test(chr28_seppop$fall, chr28_seppop$halfpounder)
## Monte-Carlo test
## Call: Hs.test(x = chr28_seppop$fall, y = chr28_seppop$halfpounder)
## 
## Observation: 0.0009344338 
## 
## Based on 999 replicates
## Simulated p-value: 1 
## Alternative hypothesis: two-sided 
## 
##    Std.Obs.y  Expectation     Variance 
## 5.492942e-04 9.299128e-04 6.774349e-05
#fall-summer
Hs.test(chr28_seppop$fall, chr28_seppop$Summer)
## Monte-Carlo test
## Call: Hs.test(x = chr28_seppop$fall, y = chr28_seppop$Summer)
## 
## Observation: 0.3674184 
## 
## Based on 999 replicates
## Simulated p-value: 0.001 
## Alternative hypothesis: two-sided 
## 
##    Std.Obs.y  Expectation     Variance 
## 2.905814e+01 3.008754e-03 1.572694e-04
#fall-winter
Hs.test(chr28_seppop$fall, chr28_seppop$Winter)
## Monte-Carlo test
## Call: Hs.test(x = chr28_seppop$fall, y = chr28_seppop$Winter)
## 
## Observation: 0.2931165 
## 
## Based on 999 replicates
## Simulated p-value: 0.001 
## Alternative hypothesis: two-sided 
## 
##    Std.Obs.y  Expectation     Variance 
## 1.252714e+01 3.186370e-03 5.356519e-04
#half-summer
Hs.test(chr28_seppop$halfpounder, chr28_seppop$Summer)
## Monte-Carlo test
## Call: Hs.test(x = chr28_seppop$halfpounder, y = chr28_seppop$Summer)
## 
## Observation: 0.366484 
## 
## Based on 999 replicates
## Simulated p-value: 0.001 
## Alternative hypothesis: two-sided 
## 
##    Std.Obs.y  Expectation     Variance 
## 2.988719e+01 4.553597e-03 1.466493e-04
#half-winter
Hs.test(chr28_seppop$halfpounder, chr28_seppop$Winter)
## Monte-Carlo test
## Call: Hs.test(x = chr28_seppop$halfpounder, y = chr28_seppop$Winter)
## 
## Observation: 0.2921821 
## 
## Based on 999 replicates
## Simulated p-value: 0.001 
## Alternative hypothesis: two-sided 
## 
##    Std.Obs.y  Expectation     Variance 
## 1.389996e+01 5.399424e-03 4.256757e-04
#summer-winter
Hs.test(chr28_seppop$Summer, chr28_seppop$Winter)
## Monte-Carlo test
## Call: Hs.test(x = chr28_seppop$Summer, y = chr28_seppop$Winter)
## 
## Observation: -0.07430186 
## 
## Based on 999 replicates
## Simulated p-value: 0.001 
## Alternative hypothesis: two-sided 
## 
##     Std.Obs.y   Expectation      Variance 
## -5.4064250597 -0.0022657271  0.0001775335
## Next lets look if there is increased/decreased diversity within a population at these markers compared to neutral
neutral_seppop  <- seppop(genind_2.0[loc=neutral_loci_names])
## Warning: the following specified loci do not exist: Omy_BAMBI2.312,
## Omy_118205-116, Omy_G3PD_2.246, Omy_u09-53.469, Omy_u09-61.043, Omy_pad-196,
## Omy_97865-196, M09AAE.082, Omy_ftzf1-217, Omy_BAMBI4.238, Omy_BAC-F5.284,
## OMS00074, Omy_ndk-152, Omy_Ogo4-212, M09AAJ.163, OMS00018, Omy_u09-56.119,
## M09AAD.076, Omy_Il-1b_.028, M09AAC.055, Omy_104569-114, OMS00173, Omy_impa1-55,
## Omy_u09-52.284, Omy_gsdf-291
#fall
t.test(summary(chr28_seppop$fall)$Hexp, summary(neutral_seppop$fall)$Hexp, alternative = "greater")
## 
##  Welch Two Sample t-test
## 
## data:  summary(chr28_seppop$fall)$Hexp and summary(neutral_seppop$fall)$Hexp
## t = 1.3138, df = 14.23, p-value = 0.1049
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -0.02162249         Inf
## sample estimates:
## mean of x mean of y 
## 0.3754156 0.3116535
#half
t.test(summary(chr28_seppop$halfpounder)$Hexp, summary(neutral_seppop$halfpounder)$Hexp, alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  summary(chr28_seppop$halfpounder)$Hexp and summary(neutral_seppop$halfpounder)$Hexp
## t = 1.3554, df = 14.338, p-value = 0.9019
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf 0.1430239
## sample estimates:
## mean of x mean of y 
## 0.3744812 0.3122231
#summer
t.test(summary(chr28_seppop$Summer)$Hexp, summary(neutral_seppop$Summer)$Hexp, alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  summary(chr28_seppop$Summer)$Hexp and summary(neutral_seppop$Summer)$Hexp
## t = -22.456, df = 84.793, p-value < 2.2e-16
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##        -Inf -0.2771162
## sample estimates:
##   mean of x   mean of y 
## 0.007997247 0.307277133
#winter
t.test(summary(chr28_seppop$Winter)$Hexp, summary(neutral_seppop$Winter)$Hexp, , alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  summary(chr28_seppop$Winter)$Hexp and summary(neutral_seppop$Winter)$Hexp
## t = -5.9751, df = 14.956, p-value = 1.288e-05
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -0.163543
## sample estimates:
##  mean of x  mean of y 
## 0.08229911 0.31376560

All pairwise population comparisons of He at chr28 are significantly different (p=0.001) EXCEPT fall-halfpounder. Winter and Summer populations demonstrate significantly reduced He at chr28 markers relative to neutral markers.

Are there significant departures from HWE at the loci level?

# here we use the hw.test function from pegas (exact test based on Monte Carlo permutations of alleles, 1000 permutations)
HWE.test <- lapply(n.pop, function(x) hw.test(x, B=1000))
# here we take the list of dataframes of p-values and combine into a single dataframe
hwe <- reduce(HWE.test, cbind)
hwe <- hwe[,c(4,8,12,16)]
colnames(hwe) <- c("fall", "halfpounder", "summer", "winter")
hwe <- as.data.frame(hwe)

# next we correct for multiple comparisons
p.adj <- as.data.frame(apply(hwe, MARGIN = 2, function(x) p.adjust(x, "fdr")))
hwe_exceed <- p.adj %>% rownames_to_column(var="marker") %>%
  pivot_longer(-marker, names_to = "run", values_to = "fdr") %>%
  filter(fdr < 0.1)

hwe_exceed%>%
  group_by(run) %>%
  tally()

Yes, see table above for number of SNPs outside of HWE (fdr < 0.1) per “population”

#lets check if there is an enrichment of run-timing and adaptive markers in markers out of HWE in fall and half-pounders
hwe_exceed <- hwe_exceed %>%
  left_join(pivot_wider(marker_divs, names_from = obs_exp, values_from = H))
## Joining, by = c("marker", "run")
#lets get marker info, but only for markers in the panel
marker_mapping2 <- marker_mapping %>%
  mutate(marker = str_replace(marker, "Omy(\\d+)", "Chr\\1")) %>%
  filter(marker %in% colnames(genos_2.2))

# halfpunder enriched for run timing markers
a <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$run_timing_marker=="run",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$run_timing_marker!="run",])
d <- nrow(marker_mapping2)-b

fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(a, b, c, d), nrow = 2)
## p-value = 1.119e-10
## alternative hypothesis: true odds ratio is less than 1
## 95 percent confidence interval:
##  0.0000000 0.2845718
## sample estimates:
## odds ratio 
##  0.1624992
#fall for run-timing markers
a <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$run_timing_marker=="run",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$run_timing_marker!="run",])
d <- nrow(marker_mapping2)-b

fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(a, b, c, d), nrow = 2)
## p-value = 0.003405
## alternative hypothesis: true odds ratio is less than 1
## 95 percent confidence interval:
##  0.0000000 0.5597078
## sample estimates:
## odds ratio 
##  0.1493741
#halfpounder for adaptive markers
a <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$neutral=="adaptive",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="halfpounder" & hwe_exceed$neutral!="adaptive",])
d <- nrow(marker_mapping2)-b

fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(a, b, c, d), nrow = 2)
## p-value = 0.4036
## alternative hypothesis: true odds ratio is less than 1
## 95 percent confidence interval:
##  0.000000 1.406637
## sample estimates:
## odds ratio 
##   0.914517
#fall for adaptive markers
a <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$neutral=="adaptive",])
c <- nrow(marker_mapping2)-a
b <- nrow(hwe_exceed[hwe_exceed$run=="fall" & hwe_exceed$neutral!="adaptive",])
d <- nrow(marker_mapping2)-b

fisher.test(matrix(c(a,b,c,d), nrow=2), alternative = "less")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(a, b, c, d), nrow = 2)
## p-value = 0.9427
## alternative hypothesis: true odds ratio is less than 1
## 95 percent confidence interval:
##  0.000000 6.200099
## sample estimates:
## odds ratio 
##   2.027023

Markers significantly out of HWE (Ho < He: excess of homozygotes) are enriched for run-timing markers, but not adaptive markers in the fall and half-pounder groups.

Now let’s make a nice visual representation of this information (fall and half pounder demonstrate a dearth of heterozygotes at run ting markers)

plot_data_half <- marker_divs %>%
  filter(run == "halfpounder") %>%
  pivot_wider( names_from = obs_exp, values_from = H)

ggplot(plot_data_half)+geom_point(aes(He, Ho, color = run_timing_marker))+geom_abline(aes(intercept=0, slope=1))+coord_equal(ratio=1)+ylim(0,0.6)+scale_color_viridis_d()+theme_classic()+ggtitle("halfpounder")
## Warning: Removed 12 rows containing missing values (geom_point).

plot_data_fall <- marker_divs %>%
  filter(run == "fall") %>%
  pivot_wider( names_from = obs_exp, values_from = H)

ggplot(plot_data_fall)+geom_point(aes(He, Ho, color = run_timing_marker))+geom_abline(aes(intercept=0, slope=1))+coord_equal(ratio=1)+ylim(0,0.6)+scale_color_viridis_d()+theme_classic()+ggtitle("fall")
## Warning: Removed 12 rows containing missing values (geom_point).

It’s surprising that there is enrichment of run timing markers among markers out of HWE in the fall run from these figures, but double checked the code, still significant enriched… Meanwhile the results in halfpounders are very clear.

Heterozygosity results summary

At neutral markers, there is a slight reduction of observed from expected heterozygosity, most pronounced among summer run fish. At run timing markers, there is a significant reduction in diversity (Ho) relative to neutral markers among the summer run and winter run fish (one-sided t-test), and increased diversity at fall run and half-pounders (not significant). Furthermore, there is an excess of homozygosity at run timing associated SNPs in the half-pounders (but not fall fish) suggesting some structure within this group. Among markers with significant departures from HWE within fall and half-pounders, there is a significant enrichment of run-timing associated markers (p = p-value = 1.707e-10 and p = p-value = 0.02129, odds ratios 6.513 and 5.63, respectively fisher’s exact test), but not significant enrichment for markers annotated as adaptive overall.

Fst

Let’s move on to f-statistics. We’ll use hierfstats for most of this work, so the first step is to convert to a hierfstat format. Then we’ll calculate some basic statists and Fst (Nei)

fstat <- genind2hierfstat(genind_2.0)
colnames(fstat) <- c(pop, names(genind_2.0$loc.n.all))
# and also lets make datasets at different sets of loci, because hierfstat doesn't easily retain loci names for later use
fstat_neutral <- genind2hierfstat(genind_2.0[loc=neutral_loci_names])
## Warning: the following specified loci do not exist: Omy_BAMBI2.312,
## Omy_118205-116, Omy_G3PD_2.246, Omy_u09-53.469, Omy_u09-61.043, Omy_pad-196,
## Omy_97865-196, M09AAE.082, Omy_ftzf1-217, Omy_BAMBI4.238, Omy_BAC-F5.284,
## OMS00074, Omy_ndk-152, Omy_Ogo4-212, M09AAJ.163, OMS00018, Omy_u09-56.119,
## M09AAD.076, Omy_Il-1b_.028, M09AAC.055, Omy_104569-114, OMS00173, Omy_impa1-55,
## Omy_u09-52.284, Omy_gsdf-291
colnames(fstat_neutral) <- c(pop, names(genind_2.0[loc=neutral_loci_names]$loc.n.all))
## Warning: the following specified loci do not exist: Omy_BAMBI2.312,
## Omy_118205-116, Omy_G3PD_2.246, Omy_u09-53.469, Omy_u09-61.043, Omy_pad-196,
## Omy_97865-196, M09AAE.082, Omy_ftzf1-217, Omy_BAMBI4.238, Omy_BAC-F5.284,
## OMS00074, Omy_ndk-152, Omy_Ogo4-212, M09AAJ.163, OMS00018, Omy_u09-56.119,
## M09AAD.076, Omy_Il-1b_.028, M09AAC.055, Omy_104569-114, OMS00173, Omy_impa1-55,
## Omy_u09-52.284, Omy_gsdf-291
fstat_run_timing <- genind2hierfstat(genind_2.0[loc=run_timing_loci_names])
## Warning: the following specified loci do not exist: Omy_RAD52458-17
colnames(fstat_run_timing) <- c(pop, names(genind_2.0[loc=run_timing_loci_names]$loc.n.all))
## Warning: the following specified loci do not exist: Omy_RAD52458-17
#calculate datset wide basic stats
basicstats <- basic.stats(fstat)
basicstats_neutral <- basic.stats(fstat_neutral)
basicstats_run_timing <- basic.stats(fstat_run_timing)

basicstats$overall
##     Ho     Hs     Ht    Dst    Htp   Dstp    Fst   Fstp    Fis   Dest 
## 0.2801 0.2888 0.2974 0.0086 0.3003 0.0114 0.0288 0.0381 0.0304 0.0161
basicstats_neutral$overall
##     Ho     Hs     Ht    Dst    Htp   Dstp    Fst   Fstp    Fis   Dest 
## 0.3045 0.3135 0.3149 0.0014 0.3154 0.0018 0.0043 0.0058 0.0288 0.0026
basicstats_run_timing$overall
##     Ho     Hs     Ht    Dst    Htp   Dstp    Fst   Fstp    Fis   Dest 
## 0.1860 0.2118 0.3917 0.1799 0.4516 0.2399 0.4593 0.5311 0.1218 0.3043
basicstats$perloc$run_timing <- if_else(rownames(basicstats$perloc) %in% run_timing_loci_names, "run", "non-run")
basicstats$perloc$neutral <- if_else(rownames(basicstats$perloc) %in% neutral_loci_names, "neutral", "adaptive")

ggplot(basicstats$perloc)+geom_histogram(aes(x=Fst, fill=neutral))+theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(basicstats$perloc)+geom_histogram(aes(x=Fst, fill=run_timing))+theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#check if any non-run timing markers have high Fst
max((basicstats$perloc[basicstats$perloc$run_timing=="non-run",])$Fst, na.rm = TRUE)
## [1] 0.0486

There are some clear fst outlier loci in the dataset. These outliers are composed solely of known run-timing markers, but not all run-timing markers demostrate high differentiation.

Per group mean Fst

Let’s collect genetic distance info on pairs of pops as well (Weir and Cochran 1984)

genet.dist(fstat, method="WC84")
##                    fall halfpounder      Summer
## halfpounder 0.001180695                        
## Summer      0.041715265 0.042413548            
## Winter      0.022561337 0.016945631 0.097702619
genet.dist(fstat_neutral, method="WC84")
##                    fall halfpounder      Summer
## halfpounder 0.000193948                        
## Summer      0.009047995 0.008312525            
## Winter      0.003447739 0.002929805 0.010647062

Moderate differentiation between most pairwise pop comps (not fall and halfpounder), but as suspected, this is driven mostly by the run timing markers. When examining only neutral annotated loci, differentiation is extremely low (less than 1%). Also of note here is that both the neutral and total estimates of differentiation between early and late summer (fall) populations (4.2% and 0.9%, respectively) is quite high. Mean Fst over all loci is about half the level of differentation between early summer and winter populations (9.7%) and nearly the same when examining only neutral loci (1.1%). Indeed, the late summer run fish are demonstrate lower differentiation from winter run than early summer run fish at both neutral and all loci.

Population Structure

From the Fst estimation in the section above we have our first ideas about population structure: fall run and half pounder fish demonstrate little to no differentiation, and this group is less differentiated from winter run than summer run fish. In this section we will conduct several analyses to uncover population structure.

STRUCTURE

Here we use a bayesian, model based clustering algorithm (STRUCTURE) to infer population structure and estimate admixture proportions of individual samples.

First we need to get our dataset ready for structure: remove linked loci, convert to structure format.

# first lets calculate LD (dartR has a great (fast) ld estimator that works right on genind files, so let's use this)
ldreport <- dartR::gl.report.ld(genind_2.0, name = NULL, save = FALSE, nchunks = 2, ncores = 3, chunkname = NULL)
## Start to calculate LD for all pairs of loci...
## Using 3 cores in 2  chunks.
## Depending on the number of loci this may take a while...
## nchunks specifies the number of steps in the progress bar and the number of intermediate saves, but slows the computation a bit. nchunks = 1 is fastest.
## Seperate all 882 loci...
## Generate all possible pairs: 62835 ...
## Calculate LD for all pairs...
## 
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
## # Simulations: 62835 . Took 53 seconds.

We’ll prune (keep one) the dataset of any locus-pairs with r2 > 0.2, then convert to STRUCTURE format

unlinked_genind <- genind_2.0[loc=-unique(ldreport[ldreport$R2>0.2,]$loc2)]
rm(ldreport)
#note just sort of crashed through this with a text editor, not easily logged, but the general idea was transpose the data, split columns (diploid to dual haploid) then convert data to integers
df <- genind2df(unlinked_genind)
df <- as.data.frame(t(df))
write_tsv(df, "genotype_data/all.str.tmp")
df <- read_tsv("genotype_data/all.str.tmp", col_names = FALSE)
df <- t(df)
write_tsv(as.data.frame(df), "genotype_data/all.str", col_names = FALSE)

This removed 25 highly linked SNPs from the dataset.

Run Log

Structure was run in a GUI outside this computation notebook’s environment.
admixture model: admixture, with correlated allele frequency
burnin/mcmc: ran with k=1-3 for 100k iteration to check for convergence of alpha, strong convergence after a few hundred burn-in iterations, used 10,000/20,000 for final runs
replicates: did 10 replicates for k=1-7

best K was chosen by the evanno method, and estimated in structure harvester
replicate results within each K were clumpp’d using the clumpak algorithm on the clumpak webserver

Results

Here we visualize the structure results of the clumpp’d results of all K values

note to prep for Monday: delta K suggests best k is two, however, particularly at low differentiation, the delta k method is biased towards k=2 (cullingham 2020), seems like a good place to remind myself that k is a model that doesn’t always fully catpure biological realtiy and comparing results across different levels of k can proide interesting insights, particualrly int the case of low differentiation. Also a note that it might be good to review the literature again on bayesian clustering methods for low levels of differentation, e.g. waples and gaggioti2006 and latch 2006.

Best K

Best K was 2 according to the Evanno (delta K) method, however, it’s important to remember the bias toward k=2 when differentiation is low or there is no population structure using this method. Delta k literally cannot evaluate K=1.

delta K plot

delta K plot

Structure Plots

Next let’s take the clumppd results and make some publication-ready figures

#import clump results into dataframes
# results are in files k*/majorcluster/clumppfiles/clumpindoutput
# took these files and captured the relevant data with a text editor (original input files are a mess with multiple field separators) and saved to new files
k2 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k2.txt")
k3 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k3.txt")
k4 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k4.txt")
k5 <- read_tsv("./structure/halfpounder/clumpak/formatted_results/k5.txt")


plot_data <- k2 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust2) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

a <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(2))

plot_data <- k3 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust3) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

b <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(3))

plot_data <- k4 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust4) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

c <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(4))

plot_data <- k5 %>% 
  rownames_to_column(var="id") %>% 
  group_by(pop) %>%
  sample_n(40) %>% # sample the half_pounder and fall fish to smaller size for plot
  ungroup() %>%
  gather('cluster', 'prob', clust1:clust5) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

d <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white")) +
  scale_fill_manual(values = viridisLite::viridis(5))



cowplot::plot_grid(a,b,c,d, ncol=1)

plot_data <- k2 %>% 
  rownames_to_column(var="id") %>% 
 # sample the half_pounder and fall fish to smaller size for plot
  gather('cluster', 'prob', clust1:clust2) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()


a <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(2))

plot_data <- k3 %>% 
  rownames_to_column(var="id") %>% 
  gather('cluster', 'prob', clust1:clust3) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

b <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(3))

plot_data <- k4 %>% 
  rownames_to_column(var="id") %>% 
  gather('cluster', 'prob', clust1:clust4) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

c <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white"), strip.text.x = element_blank()) +
  scale_fill_manual(values = viridisLite::viridis(4))

plot_data <- k5 %>% 
  rownames_to_column(var="id") %>% 
  gather('cluster', 'prob', clust1:clust5) %>%
  group_by(id) %>% 
  mutate(likely_assignment = cluster[which.max(prob)],
         assingment_prob = max(prob)) %>% 
  arrange(likely_assignment, desc(assingment_prob)) %>% 
  ungroup()

d <- ggplot(plot_data, aes(id, prob, fill = cluster)) +
  geom_col(width=1.0) +
  facet_grid(~pop, scales = 'free', space = 'free', switch = "x") +
  scale_y_continuous(expand = c(0, 0)) +
  scale_x_discrete(expand = expand_scale(add = 1)) +
  theme(panel.spacing=unit(0.1, "lines"), axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), legend.position = "none", axis.title.y=element_blank(), strip.background = element_rect(color = "white", fill = "white")) +
  scale_fill_manual(values = viridisLite::viridis(5))



cowplot::plot_grid(a,b,c,d, ncol=1)

STRUCTURE Results Summary

  1. Regardless of number of putative ancestral genetic clusters (k) modeled, there was a high degree of admixture within individuals, consistent with the observation of high gene flow among a priori assigned groupings.
  2. at K=3 and greater, there is a clear summer associated genetic cluster with highly variable amounts of introgression of this genetic cluster into fall and half pounder assigned fish, and to a lesser extent winter run fish
  3. Similar to the results from k means clustering (see dapc below), as k increases, there is decreasing shared cluster membership among winter and summer run fish, while half pounder and fall run fish continue to split evenly among clusters

PCA

Below we perform an unconstrained ordination of genetic data for both the neutral and the full datasets using PCA (implemented in ade4 and adegenet)

Neutral PCA

first we run pca only on the neutral dataset

neutral_genind <- genind_2.0[loc=neutral_loci_names]

#replace missing data using mean allele frequency
neutral_genind_scale <- scaleGen(neutral_genind, NA.method="mean")

# run the PCA, dudi.pca uses an interactive prompt to choose pcs to retain, we retain all in order to run some analyses on eigenvalues
pca1 <- dudi.pca(neutral_genind_scale,cent=FALSE,scale=FALSE, scannf = FALSE, nf = 332)
barplot(pca1$eig[1:332],main="PCA eigenvalues")

s.class(pca1$li, pop(neutral_genind),xax=1,yax=2, col=transp(viridisLite::viridis(4),0.7), axesell=FALSE, cstar=0, cpoint=1, grid=FALSE)
add.scatter.eig(pca1$eig[1:10],nf=3,xax=1,yax=2)

Let’s also make an interactive 3d plot, since the third PC is not much less informative than the second

plot_ly(x=pca1$li$Axis1, y=pca1$li$Axis2, z=pca1$li$Axis3, type="scatter3d", mode="markers", color=neutral_genind$pop, alpha = 0.8)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

As suggested by our Fst results, there is little differentiation at neutral markers. Also of note is a halfpounder outlier along the first pc. It should not come as a surprise that pca fails to differentiate among populations given the fst. PCA should not be able to differentiate among pops when fst < 1/(sqrt(sample size X number of markers)), which in our case is about 3%, substantially greater than the fst calculated for neutral markers (patterson 2006)

Full PCA

#replace missing data using mean allele frequency
genind_scale <- scaleGen(genind_2.0, NA.method="mean")

# run the PCA, dudi.pca uses an interactive prompt to choose pcs to retain, we retain all in order to run some analyses on eigenvalues
pca1 <- dudi.pca(genind_scale,cent=FALSE,scale=FALSE, scannf = FALSE, nf = 332)
barplot(pca1$eig[1:332],main="PCA eigenvalues")

s.class(pca1$li, pop(genind_2.0),xax=1,yax=2, col=transp(viridisLite::viridis(4),0.7), axesell=FALSE, cstar=0, cpoint=1, grid=FALSE)
add.scatter.eig(pca1$eig[1:10],nf=3,xax=1,yax=2)

Let’s also make an interactive 3d plot, since the third PC is not much less informative than the second

plot_ly(x=pca1$li$Axis1, y=pca1$li$Axis2, z=pca1$li$Axis3, type="scatter3d", mode="markers", color=genind_2.0$pop, alpha = 0.8)

In the first three most important axes of genetic differentiation using the full dataset we observe complete overlap of fall and halfpounder fish, winter and summer fish are clearly separated, but the cloud of fall-halfpounder fish is inclusive of both of these groups. Three clusters of data are present in the data, (1) a winter-like cluster also including some fall run and half pounders, (2) a summer-like cluster also including some fall and half pounder fush and (3) a uniquely half-pounder fall run cluster positioned intermediate between the two.

DAPC

Here we pull a little harder to find some population structure. We conduct a discriminant analysis on PCA transformed genetic variation (DAPC).

k means

First we will use k-means clustering to infer the number of genetic clusters in the dataset without applying a priori assumptions about the number of clusters or population assignments in the sample datset. Then we will check if these groups match well with assigned pops before proceeding to the DAPC.

#k means clustering, keep a lot (330 pcs) (kmeans won't overfit with two many pcs)
#note the number of clusters was chosen interactively, the code below executes the clustering using the best k
clusts <- find.clusters(genind_2.0, n.pca = 330, choose.n.clust = FALSE)

#plot BIC
bic <- as.data.frame(cbind(c(1:length(clusts$Kstat)), clusts$Kstat))
ggplot(bic)+geom_point(aes(x=V1, y=V2))+geom_line(aes(x=V1, y=V2))+theme_classic()+xlim(c(0, 50))+xlab("k")+ylab("BIC")

clusts <- find.clusters(genind_2.0, n.pca = 330, n.clust = 2)
table(pop(genind_2.0), clusts$grp)
##              
##                 1   2
##   fall         58  99
##   halfpounder 307 336
##   Summer        0  42
##   Winter       40   0
clusts <- find.clusters(genind_2.0, n.pca = 330, n.clust = 3)
table(pop(genind_2.0), clusts$grp)
##              
##                 1   2   3
##   fall         42  66  49
##   halfpounder 251 169 223
##   Summer        0   0  42
##   Winter       33   7   0
clusts <- find.clusters(genind_2.0, n.pca = 330, n.clust = 4)
table(pop(genind_2.0), clusts$grp)
##              
##                 1   2   3   4
##   fall         54  26  26  51
##   halfpounder 138 202  90 213
##   Summer       14   0   0  28
##   Winter        0  32   8   0

The model fit is best at k=6, however, only minor improvements in fit are beyond k=4. BIC tends to decrease until best fit only in a perfect island model, in practice, best K is often found at the lowest K that is a substantial improvement from the last. Here we use k=4, but when other k were used, generally the same result: Never any overlap between winter and summer run fish, fall and halfpounders always distributed among the 4 groups.

This result fits with our previous findings, there is structure and high diversity within fall and halfpounders (see heterozygosity section) and substantial overlap between fall-halfpounders and both winter and summer fish, but not between winter and summer fish.

This creates a complex scenario for conducting the DAPC, should we use a priori assigned populations as the basis of our DA? Simulation stuides (eg miller et al 2020) suggests that at low fst, the ability of k means clustering to succesfully recapitulate real genetic clusters is low, however, DA on biologically inaccurate grouping variables is not meaningful… Let’s do both for now as they are both informative in different ways.

DAPC de novo

#first optimize the PCs retained based on the a.score, so run dapc on the full pcs
dapc_full_denovo <- dapc(genind_2.0, clusts$grp, n.pca = 330, n.da = 3 )
optim.a.score(dapc_full_denovo)

## $pop.score
## $pop.score$`330`
##         1         2         3         4 
## 0.2888350 0.2657692 0.3112903 0.2575342 
## 
## $pop.score$`1`
##          1          2          3          4 
##  0.4223301  0.8988462  0.1612903 -0.3366438 
## 
## $pop.score$`50`
##         1         2         3         4 
## 0.7014563 0.5165385 0.8459677 0.4167808 
## 
## $pop.score$`100`
##         1         2         3         4 
## 0.5936893 0.4653846 0.6959677 0.4113014 
## 
## $pop.score$`150`
##         1         2         3         4 
## 0.5058252 0.4176923 0.6233871 0.3650685 
## 
## $pop.score$`200`
##         1         2         3         4 
## 0.4567961 0.3892308 0.4983871 0.3452055 
## 
## $pop.score$`250`
##         1         2         3         4 
## 0.3932039 0.3188462 0.4104839 0.2945205 
## 
## $pop.score$`300`
##         1         2         3         4 
## 0.3199029 0.3038462 0.3612903 0.2434932 
## 
## 
## $mean
##       330         1        50       100       150       200       250       300 
## 0.2808572 0.2864557 0.6201858 0.5415858 0.4779933 0.4224049 0.3542636 0.3071331 
## 
## $pred
## $pred$x
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
## [109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
## [127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
## [145] 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
## [163] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## [181] 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## [199] 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216
## [217] 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234
## [235] 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252
## [253] 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270
## [271] 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
## [289] 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306
## [307] 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324
## [325] 325 326 327 328 329 330
## 
## $pred$y
##   [1] 0.2864557 0.2954996 0.3045385 0.3135666 0.3225785 0.3315685 0.3405309
##   [8] 0.3494602 0.3583507 0.3671969 0.3759930 0.3847336 0.3934129 0.4020254
##  [15] 0.4105654 0.4190274 0.4274057 0.4356947 0.4438887 0.4519823 0.4599696
##  [22] 0.4678452 0.4756035 0.4832387 0.4907453 0.4981177 0.5053503 0.5124374
##  [29] 0.5193734 0.5261527 0.5327698 0.5392188 0.5454944 0.5515908 0.5575024
##  [36] 0.5632237 0.5687489 0.5740726 0.5791890 0.5840925 0.5887776 0.5932386
##  [43] 0.5974699 0.6014659 0.6052210 0.6087295 0.6119859 0.6149844 0.6177196
##  [50] 0.6201858 0.6223795 0.6243057 0.6259715 0.6273840 0.6285503 0.6294776
##  [57] 0.6301729 0.6306434 0.6308962 0.6309384 0.6307771 0.6304194 0.6298724
##  [64] 0.6291433 0.6282391 0.6271670 0.6259341 0.6245475 0.6230143 0.6213417
##  [71] 0.6195366 0.6176063 0.6155579 0.6133985 0.6111351 0.6087750 0.6063252
##  [78] 0.6037927 0.6011849 0.5985087 0.5957712 0.5929797 0.5901411 0.5872626
##  [85] 0.5843514 0.5814145 0.5784590 0.5754921 0.5725208 0.5695524 0.5665938
##  [92] 0.5636523 0.5607348 0.5578486 0.5550008 0.5521984 0.5494486 0.5467585
##  [99] 0.5441352 0.5415858 0.5391159 0.5367251 0.5344116 0.5321735 0.5300088
## [106] 0.5279157 0.5258923 0.5239367 0.5220470 0.5202213 0.5184578 0.5167545
## [113] 0.5151095 0.5135210 0.5119870 0.5105058 0.5090753 0.5076937 0.5063592
## [120] 0.5050697 0.5038235 0.5026186 0.5014532 0.5003253 0.4992332 0.4981747
## [127] 0.4971482 0.4961517 0.4951833 0.4942411 0.4933232 0.4924278 0.4915529
## [134] 0.4906967 0.4898573 0.4890327 0.4882212 0.4874207 0.4866295 0.4858456
## [141] 0.4850671 0.4842921 0.4835188 0.4827453 0.4819696 0.4811899 0.4804044
## [148] 0.4796110 0.4788079 0.4779933 0.4771655 0.4763244 0.4754700 0.4746025
## [155] 0.4737220 0.4728285 0.4719223 0.4710033 0.4700718 0.4691278 0.4681714
## [162] 0.4672028 0.4662220 0.4652292 0.4642245 0.4632080 0.4621798 0.4611399
## [169] 0.4600886 0.4590260 0.4579520 0.4568669 0.4557708 0.4546637 0.4535459
## [176] 0.4524173 0.4512781 0.4501284 0.4489683 0.4477980 0.4466175 0.4454270
## [183] 0.4442265 0.4430162 0.4417962 0.4405666 0.4393275 0.4380790 0.4368212
## [190] 0.4355542 0.4342782 0.4329933 0.4316995 0.4303970 0.4290859 0.4277662
## [197] 0.4264382 0.4251019 0.4237574 0.4224049 0.4210444 0.4196765 0.4183015
## [204] 0.4169200 0.4155324 0.4141391 0.4127406 0.4113374 0.4099300 0.4085188
## [211] 0.4071042 0.4056868 0.4042669 0.4028452 0.4014219 0.3999976 0.3985727
## [218] 0.3971478 0.3957232 0.3942995 0.3928770 0.3914563 0.3900378 0.3886220
## [225] 0.3872093 0.3858002 0.3843952 0.3829947 0.3815991 0.3802091 0.3788249
## [232] 0.3774471 0.3760762 0.3747125 0.3733566 0.3720090 0.3706700 0.3693401
## [239] 0.3680199 0.3667097 0.3654101 0.3641215 0.3628443 0.3615790 0.3603261
## [246] 0.3590861 0.3578594 0.3566464 0.3554477 0.3542636 0.3530946 0.3519404
## [253] 0.3508006 0.3496749 0.3485630 0.3474646 0.3463792 0.3453065 0.3442461
## [260] 0.3431978 0.3421612 0.3411359 0.3401216 0.3391179 0.3381245 0.3371411
## [267] 0.3361672 0.3352026 0.3342468 0.3332997 0.3323607 0.3314295 0.3305059
## [274] 0.3295894 0.3286797 0.3277765 0.3268794 0.3259881 0.3251022 0.3242213
## [281] 0.3233452 0.3224734 0.3216057 0.3207416 0.3198809 0.3190232 0.3181681
## [288] 0.3173152 0.3164643 0.3156151 0.3147670 0.3139199 0.3130732 0.3122268
## [295] 0.3113803 0.3105332 0.3096853 0.3088363 0.3079856 0.3071331 0.3062785
## [302] 0.3054216 0.3045627 0.3037018 0.3028389 0.3019742 0.3011077 0.3002395
## [309] 0.2993696 0.2984982 0.2976253 0.2967510 0.2958754 0.2949984 0.2941203
## [316] 0.2932411 0.2923609 0.2914797 0.2905976 0.2897147 0.2888310 0.2879467
## [323] 0.2870618 0.2861763 0.2852905 0.2844043 0.2835177 0.2826310 0.2817441
## [330] 0.2808572
## 
## 
## $best
## [1] 60
#nice now we use the optimized a score to run our dapc for real
dapc_full_denovo <- dapc(genind_2.0, clusts$grp, n.pca = 57, n.da = 3 )

plot_data <- as.data.frame(cbind(dapc_full_denovo$ind.coord, genind_2.0$pop, clusts$grp))
colnames(plot_data) <- c("LD1", "LD2", "LD3", "pop", "grp")
plot_data$pop <- as.factor(plot_data$pop)
plot_data$grp <- as.factor(plot_data$grp)

ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color=pop))+scale_color_viridis_d()+theme_classic()+ggtitle("DAPC by cluster, color by a priori pop")

ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color=grp))+scale_color_viridis_d()+theme_classic()+ggtitle("DAPC by cluster, color cluster")

marker_loadings1 <- loadingplot(dapc_full_denovo$var.contr, axis=1,thres=.005, lab.jitter=1)

markers1 <- unique(substr(names(marker_loadings1$var.values),1,nchar(names(marker_loadings1$var.values))-2))

marker_loadings2 <- loadingplot(dapc_full_denovo$var.contr, axis=2,thres=.005, lab.jitter=1)

markers2 <- unique(substr(names(marker_loadings2$var.values),1,nchar(names(marker_loadings2$var.values))-2))

marker_mapping2 %>%
  filter(marker %in% markers1 | marker %in% markers2 ) %>%
  select(marker, `Presumed Type`)

As suspected, given the the fact that fall and halfpounder fish do not fall into a single genetic cluster, de novo assigned genetic cluters do not do a great job of differentiating among a priori assigned run-timing groups. When breaking the fish into four groups these groups are mostly defined by covariation at ~20 markers including run timing markers, neutral markers, and residency markers.

DAPC a priori

#first optimize the PCs retained based on the a.score, so run dapc on the full pcs
dapc_full_apriori <- dapc(genind_2.0, n.pca = 330, n.da = 3 )
optim.a.score(dapc_full_denovo)

## $pop.score
## $pop.score$`57`
##         1         2         3         4 
## 0.6548544 0.5392308 0.8604839 0.4232877 
## 
## $pop.score$`1`
##          1          2          3          4 
##  0.4223301  0.7900000  0.1612903 -0.2517123 
## 
## $pop.score$`5`
##         1         2         3         4 
## 0.9820388 0.6919231 0.9596774 0.2386986 
## 
## $pop.score$`10`
##         1         2         3         4 
## 0.9252427 0.6350000 0.9629032 0.3030822 
## 
## $pop.score$`15`
##         1         2         3         4 
## 0.8868932 0.6026923 0.9693548 0.3390411 
## 
## $pop.score$`20`
##         1         2         3         4 
## 0.8266990 0.5876923 0.9314516 0.3599315 
## 
## $pop.score$`25`
##         1         2         3         4 
## 0.7825243 0.5607692 0.9491935 0.3832192 
## 
## $pop.score$`30`
##         1         2         3         4 
## 0.7621359 0.5723077 0.9024194 0.3907534 
## 
## $pop.score$`35`
##         1         2         3         4 
## 0.7611650 0.5507692 0.8959677 0.3945205 
## 
## $pop.score$`40`
##         1         2         3         4 
## 0.7296117 0.5380769 0.8766129 0.4041096 
## 
## $pop.score$`45`
##         1         2         3         4 
## 0.7131068 0.5419231 0.8443548 0.4256849 
## 
## $pop.score$`50`
##         1         2         3         4 
## 0.6985437 0.5415385 0.8516129 0.4126712 
## 
## $pop.score$`55`
##         1         2         3         4 
## 0.6878641 0.5265385 0.8161290 0.4126712 
## 
## 
## $mean
##        57         1         5        10        15        20        25        30 
## 0.6194642 0.2804770 0.7180845 0.7065570 0.6994954 0.6764436 0.6689266 0.6569041 
##        35        40        45        50        55 
## 0.6506056 0.6371028 0.6312674 0.6260916 0.6108007 
## 
## $pred
## $pred$x
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
## [51] 51 52 53 54 55 56 57
## 
## $pred$y
##  [1] 0.2804770 0.4150841 0.5396417 0.6440189 0.7180845 0.7551267 0.7621107
##  [8] 0.7494209 0.7274415 0.7065570 0.6948917 0.6915292 0.6932933 0.6970074
## [15] 0.6994954 0.6983333 0.6941078 0.6881583 0.6818238 0.6764436 0.6730226
## [22] 0.6712277 0.6703917 0.6698471 0.6689266 0.6671353 0.6646691 0.6618962
## [29] 0.6591851 0.6569041 0.6553076 0.6541946 0.6532500 0.6521586 0.6506056
## [36] 0.6483736 0.6456361 0.6426643 0.6397294 0.6371028 0.6350010 0.6334226
## [43] 0.6323115 0.6316118 0.6312674 0.6311677 0.6309838 0.6303321 0.6288292
## [50] 0.6260916 0.6219784 0.6173198 0.6131887 0.6106581 0.6108007 0.6141979
## [57] 0.6194642
## 
## 
## $best
## [1] 7
#nice now we use the optimized a score to run our dapc for real
dapc_full_apriori <- dapc(genind_2.0, n.pca = 7, n.da = 3 )


plot_data <- as.data.frame(dapc_full_apriori$ind.coord)
plot_data$pop <- as.character(genind_2.0$pop)


ggplot(data=plot_data)+geom_point(aes(x=LD1, y=LD2, color = pop), alpha = 0.5, size = 2)+theme_classic()+scale_color_viridis_d()+stat_ellipse(aes(x=LD1, y=LD2, color = pop))

scatter(dapc_full_apriori)

marker_loadings1 <- loadingplot(dapc_full_apriori$var.contr, axis=1,thres=.01, lab.jitter=1)

markers1 <- unique(substr(names(marker_loadings1$var.values),1,nchar(names(marker_loadings1$var.values))-2))

marker_loadings2 <- loadingplot(dapc_full_apriori$var.contr, axis=2,thres=.01, lab.jitter=1)

markers2 <- unique(substr(names(marker_loadings2$var.values),1,nchar(names(marker_loadings2$var.values))-2))

marker_mapping2 %>%
  filter(marker %in% markers1 | marker %in% markers2 ) %>%
  select(marker, `Presumed Type`)
marker_mapping2 %>%
  filter(marker %in% markers2 ) %>%
  select(marker, `Presumed Type`)

DAPC largely discriminates among groups along the first discriminant axis. Fall and half pounder fish demonstrate little to no differentiation along any DA. The first DA strongly separates winter and summer run fish, and is driven by by 20 markers including run timing associated markers, residency vs anadromy markers and neutral markers. The second da catpures much less variation (5.53 fold less), and is driven largely by 10 markers all either neutral or associated with residency/anadromy. Interstingly two of the 2 of the 5 “neutral” annotated markers (Omy_LDHB-2…) have been associated with residency vs anadromy in an association study in the klickitat river (doi.org/10.1080/00028487.2011.588131).

Run Timing Markers

Let’s look specifically at run-timing associated markers

marker_mapping <- readxl::read_xlsx("./metadata/final_mapping_results.xlsx", sheet = 1)
run_timing_loci_names <- marker_mapping %>%
  filter(chr == "28" | CRITFC_chromosome == "28") %>%
  filter(str_detect(`Presumed Type`, 'run|Run')) %>%
  select(marker)

#different naming convention, lets fix
run_timing_loci_names <- str_replace(run_timing_loci_names$marker, "Omy28", "Chr28")

#run_timing_loci <- genos_2.2 %>%
#  select(Sample, Date, run, year, one_of(run_timing_loci_names))

#genind_pop <- seppop(genind_2.0)
#run_timing_fall <- genind_pop$fall[loc=run_timing_loci_names]
#run_timing_half <- genind_pop$halfpounder[loc=run_timing_loci_names]
#run_timing_winter <- genind_pop$Winter[loc=run_timing_loci_names]
#run_timing_summer <- genind_pop$Summer[loc=run_timing_loci_names]

Diagnostic Markers

Are any markers diagnostic (fixed or nearly fixed within a run-timing group)?

#because adegenet can be so difficult to use, let take advantage of popgenreport to collect our summary data
run_timing_counts <- allele.dist(genind_2.0[loc=run_timing_loci_names], mk.figures = FALSE)$count
## Warning: the following specified loci do not exist: Omy_RAD52458-17
#make into a dataframe
run_timing_counts <- as.data.frame(do.call(rbind, run_timing_counts))
colnames(run_timing_counts) <- c("fall_count", "half_count", "summer_count", "winter_count")
run_timing_counts$sum <- rowSums(run_timing_counts, na.rm = TRUE)

run_timing_freqs <- allele.dist(genind_2.0[loc=run_timing_loci_names], mk.figures = FALSE)$frequency
## Warning: the following specified loci do not exist: Omy_RAD52458-17
#make into a dataframe
run_timing_freqs <- as.data.frame(do.call(rbind, run_timing_freqs))

run_timing_freqs <- as.data.frame(cbind(run_timing_freqs, run_timing_counts))

##### get only minor allele
#then put the marker names in the same order as the allele.dist (exlude the missing marker first)
run_timing_loci_names <- run_timing_loci_names[run_timing_loci_names != "Omy_RAD52458-17"]
run_timing_freqs$marker <- rep(run_timing_loci_names[order(run_timing_loci_names)], each = 2)
#now group by marker and keep the minor allele, then convert counts to 
run_timing_maf <- run_timing_freqs %>%
  group_by(marker) %>%
  slice_min(sum) %>%
  replace(., is.na(.), 0)

#plot as a heatmap
col_pal<- colorRampPalette(c("red", "white", "blue"))(256)
tmat <- t(as.matrix(run_timing_maf[,1:4]))
colnames(tmat) <- run_timing_maf$marker
pheatmap(tmat)

Of the 14 markers with known run-timing associations, 10 are fixed or nearly fixed for alternative alleles in winter and summer run fish. Fall run fish and half pounder demonstrate intermediate genotypes and very little differentiation from one another.

Stray thought / hypothesis collecter

Intermediate freq Ch28 How is it possible to have intermediate maf within the Ch28 region at a small subset of markers when the rest are nearly fixed within the region within winter/early summer runs, suggests either (a) multiple haplotypes (more than two) segregating within the region, (b) two separate haplotype blocks with the “RoSA” that are not coinherited, (c) multiple haplotypes spearating within the steelehad at the “RoSA” (i.e. some Ch28 markers are not stringly associated with run timing in all pops) (d) major genotyping error. Need to go in and check out the allele correction values / genotyping plot for these markers: Chr28_11607954, Chr28_11632591, Ch28_11773194, Omy_GREB1_09

  • checked out the genotyping plots for these markers, nothing looks out of whack. however, Omy_GREB1_09 is a poorly mapping marker by both SFGL and CRITFC mapping results

Neutral structure outliers (strays) in halfpounders There is evidence that half pounders may return to Rogue despite being native to a different stream, then return to the their natal stream during their subsequent adult spawning run(s). Do we see evidence of any half-pounders from other streams (i.e. outlier fish at neutral loci). This behavior is also predicted as a consequence of the proposed explanation of the half pounder lifestyle as a behavioral adaptation to avoid a recurring region of high SSTs near the Klamath and Rogue basins

Quote from Eversest 1971: Additional evidence of this behavior was noted when returns from “half-pounders” tagged in 1968 were analyzed. Thirty-one fish were tagged in one seine-haul on August 14, 1968, and five were recaptured in the Rogue the sane year. On subsequent migrations in 1969 and 1970, only one of the fish was recaptured in the Rogue while two were taken outside the system, one in the Smith River, and one from the Trinity River, both in northern California.